259 research outputs found
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
El significado y valor de las condiciones ecológicas para el desarrollo socioeconómico del Tíbet, con especial atención al ecoturismo
Como una de las regiones autónomas a nivel provincial de China, el Tíbet tiene
una historia muy larga. Al mismo tiempo, el Tíbet ha estado cubierto por un velo de
misterio. Debido a su situación geográfica única, la economía tibetana siempre ha
estado atrasada, pero, al mismo tiempo, su desarrollo socioeconómico también ha sido
el foco del gobierno chino. Este trabajo introducirá y comprenderá el entorno ecológico y los orígenes
históricos del Tíbet. A través del análisis comparativo de los datos de la actividad
primaria, secundaria y terciaria del Tíbet, exploraremos la importancia y el valor de
las condiciones ecológicas para la economía social del Tíbet. Es nuestro propósito el
intentar analizar los cambios se han producido en la estructura industrial del Tíbet y
analizarlas para ver cómo pueden promover la mejora socioeconómica del TíbetMáster en Relaciones Internacionales y Estudios Asiático
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
Subject-driven text-to-image generation models create novel renditions of an
input subject based on text prompts. Existing models suffer from lengthy
fine-tuning and difficulties preserving the subject fidelity. To overcome these
limitations, we introduce BLIP-Diffusion, a new subject-driven image generation
model that supports multimodal control which consumes inputs of subject images
and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion
introduces a new multimodal encoder which is pre-trained to provide subject
representation. We first pre-train the multimodal encoder following BLIP-2 to
produce visual representation aligned with the text. Then we design a subject
representation learning task which enables a diffusion model to leverage such
visual representation and generates new subject renditions. Compared with
previous methods such as DreamBooth, our model enables zero-shot subject-driven
generation, and efficient fine-tuning for customized subject with up to 20x
speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with
existing techniques such as ControlNet and prompt-to-prompt to enable novel
subject-driven generation and editing applications. Code and models will be
released at
https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project
page at https://dxli94.github.io/BLIP-Diffusion-website/
Learning Autonomous Ultrasound via Latent Task Representation and Robotic Skills Adaptation
As medical ultrasound is becoming a prevailing examination approach nowadays,
robotic ultrasound systems can facilitate the scanning process and prevent
professional sonographers from repetitive and tedious work. Despite the recent
progress, it is still a challenge to enable robots to autonomously accomplish
the ultrasound examination, which is largely due to the lack of a proper task
representation method, and also an adaptation approach to generalize learned
skills across different patients. To solve these problems, we propose the
latent task representation and the robotic skills adaptation for autonomous
ultrasound in this paper. During the offline stage, the multimodal ultrasound
skills are merged and encapsulated into a low-dimensional probability model
through a fully self-supervised framework, which takes clinically demonstrated
ultrasound images, probe orientations, and contact forces into account. During
the online stage, the probability model will select and evaluate the optimal
prediction. For unstable singularities, the adaptive optimizer fine-tunes them
to near and stable predictions in high-confidence regions. Experimental results
show that the proposed approach can generate complex ultrasound strategies for
diverse populations and achieve significantly better quantitative results than
our previous method
PoseFusion: Robust Object-in-Hand Pose Estimation with SelectLSTM
Accurate estimation of the relative pose between an object and a robot hand
is critical for many manipulation tasks. However, most of the existing
object-in-hand pose datasets use two-finger grippers and also assume that the
object remains fixed in the hand without any relative movements, which is not
representative of real-world scenarios. To address this issue, a 6D
object-in-hand pose dataset is proposed using a teleoperation method with an
anthropomorphic Shadow Dexterous hand. Our dataset comprises RGB-D images,
proprioception and tactile data, covering diverse grasping poses, finger
contact states, and object occlusions. To overcome the significant hand
occlusion and limited tactile sensor contact in real-world scenarios, we
propose PoseFusion, a hybrid multi-modal fusion approach that integrates the
information from visual and tactile perception channels. PoseFusion generates
three candidate object poses from three estimators (tactile only, visual only,
and visuo-tactile fusion), which are then filtered by a SelectLSTM network to
select the optimal pose, avoiding inferior fusion poses resulting from modality
collapse. Extensive experiments demonstrate the robustness and advantages of
our framework. All data and codes are available on the project website:
https://elevenjiang1.github.io/ObjectInHand-Dataset
- …